Topic: Why can the PCA algorithm on the NanoRam 785 differentiate between calcium nitrate and magnesium nitrate, but the RVM algorithm on the NanoRam 1064 cannot?



Part One: 785 nm



Introduction



Principal component Analysis (PCA) is a procedure which takes a set of variables and transforms them into a new set of variables which have no collinearity, called principal components. Dimensionality reduction techniques such as PCA are often used in spectral classification and prediction due to the large number of variables present in each spectrum. Each data point in a spectrum is considered to be a variable, and reducing the number of variables is crucial in creating a meaningful model. PCA uses an approach called feature extraction, which involves creating new features from recombinations of existing variables in order to sufficiently describe a dataset with a smaller amount of variables.



Results



First, methods containing 20 scans each were built using the B&W Tek handheld Raman analyzer, NanoRam 785, for a single sample of both magnesium nitrate and calcium nitrate. Then the same nitrate samples were measured using a B&W Tek NanoRam 1064 handheld Raman analyzer to create a five scan method for each sample. The raw data was processed with dark subtraction, scaling, and centering. Running PCA on the processed 785 nm spectral data yields the following:





Scree Plots

Scree plots show the eigenvalues of the principal components. The first one depicts the raw values of the variance on the y axis and the principal components in descending order by eigenvalue magnitude:





The second plot depicts the percentage of the total variance explained by each principal component, again ordered in descending order by eigenvalue magnitude:





Statistical Summary

The information in the preceding graphs is also displayed analytical in the below table:



## Importance of first k=5 (out of 40) components:
##                            PC1     PC2     PC3     PC4    PC5
## Standard deviation     22.4054 4.68831 3.55776 2.43046 2.2113
## Proportion of Variance  0.8626 0.03777 0.02175 0.01015 0.0084
## Cumulative Proportion   0.8626 0.90031 0.92206 0.93221 0.9406



Confidence Ellipses

A 95% confidence ellipse was constructed for each class . Under a large number of repeated sampling from the underlying distribution, and each time calculate a confidence ellipse, 95% of the constructed ellipses would contain the underlying mean of the distribution.



Conclusion

The calcium nitrate samples had low in-class variance, but the magnesium nitrate samples had a few massive outliers that significantly raised the variance and heavily affected the PC construction. Furthermore, the between-class variance is fairly small, the two main clusters are right next to each other. Can we do better?





Part Two: 785 adjusted

Introduction



Scaling before running PCA is a seemingly contentious issue in data science and spectral literature. There is a general consensus that scaling is required when variables have different units or greatly differing standard deviations. However, spectral data variables are all just relative intensities at certain pixels, which means that the variables will have the same units and are all measured on the same scale. Given that these two conditions are met, it would be acceptable and potentially even advantageous not to scale the variables and perform the PCA based on the covariance matrix. Otherwise, scaling the variables and using the correlation matrix to perform the PCA makes more sense.



Outliers can also have a substantial impact on the construction of principal components. From part one, it was immediately obvious that there was an outlier present from the magnesium nitrate group skewing the results of the PCA. Examining the spectra in a spectroscopy software called BWIQ graphically indicates that magnesium nitrate scan #12 is vastly different from the other magnesium nitrate scans and was likely due to human error in taking the scan of the sample. Two other potential outliers are shown in the PCA graph from part one, but spectrally they are not dissimilar to the other samples in the method, and after running PCA again with the largest outlier removed, those two scans are no longer problematic.



Results



First, methods containing 20 scans each were built using the B&W Tek handheld Raman analyzer, NanoRam 785, for a single sample of both magnesium nitrate and calcium nitrate. The raw data was processed with dark subtraction, and centering. Magnesium nitrate sample scan #12 was removed from the dataset as an outlier. Running PCA on the processed 785 nm spectral data yields the following:





Scree Plots







Statistical Summary



## Importance of first k=5 (out of 39) components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     7784.6314 4094.0216 1.466e+03 730.90832 621.10275
## Proportion of Variance    0.7284    0.2015 2.582e-02   0.00642   0.00464
## Cumulative Proportion     0.7284    0.9299 9.557e-01   0.96215   0.96678



Confidence Ellipses







Interactive 3-Dimensional PCA Scores





Part Three: 1064 nm PCA

Introduction



Reduced Variable Multivariate (RVM) is another dimensionality reduction procedure which takes a set of variables and deletes many of the variables until a small subset of variables that accurately describe the dataset remain. RVM uses an approach called feature deletion, which differs from feature extraction in a couple ways. First, since variables are just removed instead of being transformed into new ones, there is a loss of data present in feature selection, whereas all of the original data is preserved in feature extraction. Second, there is no guarantee that the variables are free of collinearity in feature selection, while feature extraction does guarantee orthogonal variables with no collinearity. Before attempting to analyze the data collected on the 1064 nm device with RVM, first it will be run through the PCA model as a baseline for comparison.



Results





There don’t seem to be any outliers present, and learning from the previoius sections, the variables are not scaled in the preprocessing steps.



Scree Plots





Statistical Summary



## Importance of first k=3 (out of 12) components:
##                              PC1       PC2       PC3
## Standard deviation     3.019e+04 1.356e+04 3.713e+03
## Proportion of Variance 8.159e-01 1.645e-01 1.234e-02
## Cumulative Proportion  8.159e-01 9.803e-01 9.927e-01



Confidence Ellipses



Currently the ellipses are not being drawn…need to fix error





Interactive 3-Dimensional PCA Scores





Conclusion



The PCA algorithm has no trouble distinguising between the two kinds of nitrates on the data from the 1064 nm device either. This would point to a flaw in the RVM algorithm, as the RVM method had specificity issues when examining the nitrate samples on the 1064 nm device. Further investigation will be conducted to uncover the issues the 1064 nm device is having with its current algorithm.



Part Four: 1064 nm RVM



Introduction



I’m going to try to build an RVM algorithm from scratch since everything is soooooo top secret here. Let’s see how it turns out. All I have to go on is that RVM uses feature selection, so there is probably collinearity between the remaining variables, and that a variable inflation factor is applied, which definitely causes a decrease in specificity (more false positives). There appears to be no way to alter the default inflation factor of 5 in the device, although the internal paper mentions that it can be lowered to deal with false positive issues.



Session Info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] factoextra_1.0.6  ggfortify_0.4.8   data.table_1.12.2
##  [4] plyr_1.8.4        forcats_0.4.0     stringr_1.4.0    
##  [7] dplyr_0.8.3       purrr_0.3.2       readr_1.3.1      
## [10] tidyr_1.0.0       tibble_2.1.3      tidyverse_1.2.1  
## [13] plotly_4.9.1      ggplot2_3.2.1    
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.1     Rcpp_1.0.2        lubridate_1.7.4  
##  [4] lattice_0.20-38   assertthat_0.2.1  zeallot_0.1.0    
##  [7] digest_0.6.21     mime_0.7          R6_2.4.0         
## [10] cellranger_1.1.0  backports_1.1.4   evaluate_0.14    
## [13] httr_1.4.1        pillar_1.4.2      rlang_0.4.0      
## [16] lazyeval_0.2.2    readxl_1.3.1      rstudioapi_0.10  
## [19] rmarkdown_1.15    labeling_0.3      htmlwidgets_1.3  
## [22] munsell_0.5.0     shiny_1.3.2       broom_0.5.2      
## [25] compiler_3.6.1    httpuv_1.5.2      modelr_0.1.5     
## [28] xfun_0.9          pkgconfig_2.0.3   htmltools_0.3.6  
## [31] tidyselect_0.2.5  gridExtra_2.3     viridisLite_0.3.0
## [34] crayon_1.3.4      withr_2.1.2       ggpubr_0.2.4     
## [37] later_0.8.0       grid_3.6.1        xtable_1.8-4     
## [40] nlme_3.1-140      jsonlite_1.6      gtable_0.3.0     
## [43] lifecycle_0.1.0   magrittr_1.5      scales_1.0.0     
## [46] cli_1.1.0         stringi_1.4.3     ggsignif_0.6.0   
## [49] promises_1.0.1    xml2_1.2.2        generics_0.0.2   
## [52] vctrs_0.2.0       tools_3.6.1       glue_1.3.1       
## [55] hms_0.5.1         crosstalk_1.0.0   yaml_2.2.0       
## [58] colorspace_1.4-1  rvest_0.3.4       knitr_1.25       
## [61] haven_2.1.1